Extraction of Chemical and Drug Named Entities by Ensemble Learning Using Chemical NER Tools Based on Different Extraction Guidelines

نویسندگان

  • Thaer M. Dieb
  • Masaharu Yoshioka
چکیده

Chemical named-entity recognition (chemical NER) is the task of extracting chemical information and chemical-related entities such as drug names and source materials from text in several domains such as bioinformatics and nanoinformatics. There have been several attempts to construct corpora for handling such chemical-related information based on different corpus-construction guidelines. Even though these guidelines contain common types of chemical information, they differ in several ways. As a result, chemical NER tools developed for a particular guideline might be able to extract common chemical named entities, but they may have problems extracting other chemical-related entities. Assuming the differences between these guidelines are consistent, the pattern of success and failure of the chemical NER tools might also be consistent. In this paper, we present an ensemble-learning approach that uses the conditional random field (CRF) as a machine-learning technique to fuse a variety of different characteristic chemical NER tools based on different guidelines to construct a chemical NER for a particular guideline. To achieve consistent tokenization across these different tools, we applied a post-tokenization mechanism. We evaluated the system using the BioCreative IV, CHEMDNER task datasets. We confirmed that the ensemble-learning approach using a combination of chemical NER tools is better than a simple domain-adaptation approach using just one chemical NER tool. We also confirmed that the ensemble-learning approach could improve the performance of a well-tuned rule-based chemical NER tool on certain tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In ...

متن کامل

A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations

Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for ex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Trans. MLDM

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2015